This population study examined employee retention in a North American Acute care Hospital over a 10 year period to determine what factors influence employee turnover. The hospital was interested in building a model that could be used to predict which employees had a higher likelihood of leaving the organization so that timely interventions could be executed to improve employee retention.
The main questions that needed to be addressed were -
What factors affect voluntary employee turnover?
How can employee turnover be predicted?
How can voluntary, regrettable, avoidable employee turnover be reduced?
HR experts were consulted and HR theories such as Human Capital theory, Social Identity theory, Expectancy and Distributive justice theory were considered while carefully selecting the various factors that may affect employee turnover. The goal of the project was to build a simple actionable analytical model, simple enough for HR executives to understand, actionable, provide clear insights and assist in developing a retention campaign.
Employee termination was classified into the following types -
VTO-R-A - Voluntary turnover - Regrettable- Avoidable
VTO-R-U - Voluntary turnover - Regrettable - Unavoidable
VTO-N - Voluntary turnover - NonRegrettable
ITO - Involuntary turnover
RIF - Reduction in force
The termination type of interest was VTO-R-A - Voluntary turnover - Regrettable - Avoidable
Factors selected for analysis were grouped into 3 main categories -
Socio-Demographic variables -
ZipCode - Zipcode the employee resides in
Sex - Gender of employee
Eeoc - Race/Ethnicity of employee
Age - Current age of employee
EntryAge - Age employee joined organization
CareerAge - Age at termination if terminated else current age
Education Level - 1 - No HS; 2 - Some HS; 3 - HS; 4 - GED; 5 - Some College; 6 - Associate Degree; 7 -
Under Grad; 8 - Graduate; 9 - PHD; 10 - Post Doctoral
Experience - Number of years employed prior to joining organization
Career stage - based on levinson model - 1 - <25; 2- 25-35; 3 - 35-50; 4 - 50+;
Service Length - Factor levels based on length of total employed years - 1 - <1; 2 - 1-2; 3 - 3-5; 4 - 5-10; 5 - 11-20; 6 - 20+
Work variables -
Start Rate - hourly rate at which employee started their job at organization
BaseRate - Current base hourly rate of employee
SalaryProg - Average increase in salary per employed year
StartGrade - Grade employee started their job at
Grade - Current grade of employee
JoblevelProg - Average increased in grade per employed year
StartEval - Evaluation score of employee at start of job at organization
LastEval - Last evaluation score of employee (1 - 10 overall score)
AvgEval - Average evaluation score of employee
Tenure - Number of years employed at current organization
TotalEmployment - Total years employed since start of career
LastPromotion - Number of years since last promotion
Training - Amount spent on employee on education and training
Unscheduled - Number of unscheduled leaves taken in the last year
Shift - If the employee works on shifts (1 - No shift; 2 - Shift; 3 - Late night shifts)
Psychometric variables -
Employee Engagement Index - Measure to determine how engaged an employee is with their work and organization. A 1 - 10 score based on employee survey and manager evaluation
Employee Commitment Index - Measure to determine how committed an employee is to the organization and their job. A 1 - 10 score based on employee survey and manager evaluation
Other factors that were important but were left out due to lack of reliable data or due to client reluctance to include in model -
Supervisor/Manager of Employee
Job fit index
Market oppurtunity index
Work/Life imbalance
Stress Index
2000 employee records over a period of 10 years was selected for model building. The records mainly consisted of voluntary regrettable and avoidable turnover employees and several active employees randomly selected. A summary and sample of the data can be seen below -
library(rpart.plot)
## Loading required package: rpart
library(rpart)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(car)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
setwd("~/Documents/BI-Dashboards")
training <- read.csv("HRAnalytics_Model.csv")
dim(training)
## [1] 2000 29
summary(training)
## EmployeePPPER Status ZipCode Sex Eeoc
## Min. : 1.0 ACTIVE: 900 75028 :317 F:1679 ASIAN : 9
## 1st Qu.: 503.8 TERM :1100 75032 :210 M: 321 BLACK : 6
## Median :1007.5 75018 :208 HISPANIC: 4
## Mean :1018.2 75022 :207 WHITE :1981
## 3rd Qu.:1549.2 75042 :207
## Max. :2052.0 75012 :105
## (Other):746
## Shift BaseRate Grade Training
## Min. :1.000 Min. : 0.00 Min. : 0.00 Min. : 0
## 1st Qu.:1.000 1st Qu.: 8.14 1st Qu.: 3.75 1st Qu.:3750
## Median :1.000 Median : 11.73 Median : 6.00 Median :5000
## Mean :1.266 Mean : 15.09 Mean :18.48 Mean :4074
## 3rd Qu.:1.000 3rd Qu.: 17.35 3rd Qu.:37.00 3rd Qu.:5000
## Max. :3.000 Max. :185.10 Max. :99.00 Max. :5000
##
## LastEval StartEval Education StartGrade
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.000 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 2.00
## Median : 5.000 Median : 3.000 Median : 1.000 Median : 6.00
## Mean : 5.171 Mean : 3.192 Mean : 2.026 Mean :15.96
## 3rd Qu.: 7.000 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.:29.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :95.00
##
## StartRate Experience EngagementIndex CommitmentIndex
## Min. : 0.04 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.: 0.00 1st Qu.: 3.000 1st Qu.: 4.000
## Median : 6.00 Median : 7.00 Median : 5.000 Median : 5.000
## Mean : 11.34 Mean :10.63 Mean : 5.121 Mean : 5.295
## 3rd Qu.: 12.51 3rd Qu.:18.00 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :177.93 Max. :57.00 Max. :10.000 Max. :10.000
##
## Unscheduled Tenure Age EntryAge
## Min. : 0.000 Min. : 0.000 Min. :16.00 Min. :14.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.:30.00 1st Qu.:21.00
## Median : 5.000 Median : 4.000 Median :41.00 Median :28.00
## Mean : 4.896 Mean : 7.868 Mean :41.75 Mean :30.94
## 3rd Qu.: 7.000 3rd Qu.:11.000 3rd Qu.:52.00 3rd Qu.:39.00
## Max. :10.000 Max. :52.000 Max. :85.00 Max. :78.00
##
## CareerAge CareerStage ServiceLength SalaryProg
## Min. :15.00 Min. :1.000 Min. :1.000 Min. :-1.750
## 1st Qu.:26.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 0.220
## Median :38.00 Median :3.000 Median :3.000 Median : 0.540
## Mean :38.84 Mean :2.614 Mean :3.419 Mean : 1.246
## 3rd Qu.:50.00 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.: 1.452
## Max. :83.00 Max. :4.000 Max. :6.000 Max. : 9.950
##
## JobLevelProg AvgEval LastPromotion TotalEmployment
## Min. :-8.0000 Min. : 0.000 Min. :-2.000 Min. : 0.0
## 1st Qu.: 0.0000 1st Qu.: 3.000 1st Qu.: 1.000 1st Qu.: 6.0
## Median : 0.3300 Median : 4.000 Median : 2.000 Median :17.0
## Mean : 0.6804 Mean : 4.182 Mean : 4.369 Mean :18.5
## 3rd Qu.: 1.2900 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.:29.0
## Max. :10.0000 Max. :10.000 Max. :51.000 Max. :61.0
##
head(training)
## EmployeePPPER Status ZipCode Sex Eeoc Shift BaseRate Grade Training
## 1 1 TERM 75018 F WHITE 1 18.47 45 5000
## 2 2 ACTIVE 75002 F WHITE 1 17.35 25 5000
## 3 3 ACTIVE 75018 F WHITE 2 24.95 55 5000
## 4 4 ACTIVE 75062 F WHITE 3 17.35 25 5000
## 5 5 ACTIVE 75020 F WHITE 1 11.70 6 5000
## 6 6 TERM 75042 M WHITE 1 22.43 58 5000
## LastEval StartEval Education StartGrade StartRate Experience
## 1 1 2 5 40 15.66 0
## 2 4 0 3 21 14.44 1
## 3 2 3 6 48 22.13 5
## 4 4 1 3 19 11.65 3
## 5 0 5 1 5 6.00 0
## 6 1 4 6 57 17.51 4
## EngagementIndex CommitmentIndex Unscheduled Tenure Age EntryAge
## 1 2 4 7 17 48 21
## 2 10 9 4 14 36 22
## 3 0 3 10 29 54 26
## 4 3 6 5 12 36 24
## 5 8 6 7 29 44 15
## 6 8 6 0 2 37 25
## CareerAge CareerStage ServiceLength SalaryProg JobLevelProg AvgEval
## 1 38 3 5 0.17 0.29 1.5
## 2 36 3 5 0.21 0.29 2.0
## 3 54 4 6 0.10 0.24 2.5
## 4 36 3 5 0.48 0.50 2.5
## 5 44 3 6 0.20 0.03 2.5
## 6 27 2 3 2.46 0.50 2.5
## LastPromotion TotalEmployment
## 1 17 17
## 2 2 15
## 3 11 34
## 4 12 15
## 5 11 29
## 6 2 6
The main variables were grouped by category and scatter plots were constructed to analyse relationships among variables.
scatterplotMatrix(~Shift+BaseRate+Grade+LastEval+LastPromotion+Age|Status,data=training)
scatterplotMatrix(~Education+StartGrade+StartRate+StartEval+Experience|Status,data=training)
scatterplotMatrix(~SalaryProg+JobLevelProg+AvgEval+Tenure+CareerAge|Status,data=training)
scatterplotMatrix(~EngagementIndex+CommitmentIndex+Unscheduled|Status,data=training)
From the scatterplots relationships among various variables can be seen and their impact on Status - active vs terminated employees.
Since this is a binomial prediction of active vs terminated employees, a decision tree structure is the best method to construct to a prediction model.
We divide the data set into training and validation sets. The training set is then divided into training and test set. Various models are trained on the training set and tested on the test set. Then the best model is then validated on the validation set and measured for accuracy.
partition <- createDataPartition(y = training$Status, p = 0.8, list = FALSE)
training <- training[partition, ]
testing <- training[-partition, ]
partition <- createDataPartition(y = training$Status, p = 0.6, list = FALSE)
traindata <- training[partition, ]
testdata <- training[-partition, ]
After partitioning the data sets - the rpart method is used to construct a decision tree and check accuracy
tree <- rpart(Status ~ ZipCode+Sex+Eeoc+Shift+BaseRate+Grade+Training+LastEval+StartEval+Education+StartGrade+StartRate+Experience+EngagementIndex+CommitmentIndex+Unscheduled+Tenure+Age+EntryAge+CareerAge+CareerStage+ServiceLength+SalaryProg+JobLevelProg+AvgEval+LastPromotion+TotalEmployment, data=traindata)
tree
## n= 960
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 960 432 TERM (0.45000000 0.55000000)
## 2) LastPromotion>=0.5 798 366 ACTIVE (0.54135338 0.45864662)
## 4) BaseRate>=9.205 601 222 ACTIVE (0.63061564 0.36938436)
## 8) Age< 62.5 536 179 ACTIVE (0.66604478 0.33395522)
## 16) Tenure>=3.5 397 106 ACTIVE (0.73299748 0.26700252)
## 32) CommitmentIndex< 5.5 212 39 ACTIVE (0.81603774 0.18396226)
## 64) CareerAge>=49.5 79 6 ACTIVE (0.92405063 0.07594937) *
## 65) CareerAge< 49.5 133 33 ACTIVE (0.75187970 0.24812030)
## 130) Age< 49.5 122 22 ACTIVE (0.81967213 0.18032787) *
## 131) Age>=49.5 11 0 TERM (0.00000000 1.00000000) *
## 33) CommitmentIndex>=5.5 185 67 ACTIVE (0.63783784 0.36216216)
## 66) ZipCode=75012,75020,75028,75032,75038,NULL 71 11 ACTIVE (0.84507042 0.15492958) *
## 67) ZipCode=75001,75002,75011,75018,75022,75042,75055,75062 114 56 ACTIVE (0.50877193 0.49122807)
## 134) Shift< 1.5 97 42 ACTIVE (0.56701031 0.43298969)
## 268) EngagementIndex< 5.5 28 6 ACTIVE (0.78571429 0.21428571) *
## 269) EngagementIndex>=5.5 69 33 TERM (0.47826087 0.52173913)
## 538) JobLevelProg< 0.175 19 5 ACTIVE (0.73684211 0.26315789) *
## 539) JobLevelProg>=0.175 50 19 TERM (0.38000000 0.62000000) *
## 135) Shift>=1.5 17 3 TERM (0.17647059 0.82352941) *
## 17) Tenure< 3.5 139 66 TERM (0.47482014 0.52517986)
## 34) Experience>=28.5 19 3 ACTIVE (0.84210526 0.15789474) *
## 35) Experience< 28.5 120 50 TERM (0.41666667 0.58333333)
## 70) Age< 31.5 43 17 ACTIVE (0.60465116 0.39534884)
## 140) ZipCode=75011,75012,75018,75022,75028,75032,75042,75055,75062 36 10 ACTIVE (0.72222222 0.27777778) *
## 141) ZipCode=75001,75002,75038,NULL 7 0 TERM (0.00000000 1.00000000) *
## 71) Age>=31.5 77 24 TERM (0.31168831 0.68831169) *
## 9) Age>=62.5 65 22 TERM (0.33846154 0.66153846)
## 18) CareerAge>=62.5 44 22 ACTIVE (0.50000000 0.50000000)
## 36) Age< 68.5 16 2 ACTIVE (0.87500000 0.12500000) *
## 37) Age>=68.5 28 8 TERM (0.28571429 0.71428571)
## 74) CareerAge>=70.5 7 1 ACTIVE (0.85714286 0.14285714) *
## 75) CareerAge< 70.5 21 2 TERM (0.09523810 0.90476190) *
## 19) CareerAge< 62.5 21 0 TERM (0.00000000 1.00000000) *
## 5) BaseRate< 9.205 197 53 TERM (0.26903553 0.73096447)
## 10) Age< 19.5 18 2 ACTIVE (0.88888889 0.11111111) *
## 11) Age>=19.5 179 37 TERM (0.20670391 0.79329609)
## 22) CareerAge>=19.5 136 37 TERM (0.27205882 0.72794118)
## 44) Age< 22.5 17 4 ACTIVE (0.76470588 0.23529412) *
## 45) Age>=22.5 119 24 TERM (0.20168067 0.79831933) *
## 23) CareerAge< 19.5 43 0 TERM (0.00000000 1.00000000) *
## 3) LastPromotion< 0.5 162 0 TERM (0.00000000 1.00000000) *
printcp(tree)
##
## Classification tree:
## rpart(formula = Status ~ ZipCode + Sex + Eeoc + Shift + BaseRate +
## Grade + Training + LastEval + StartEval + Education + StartGrade +
## StartRate + Experience + EngagementIndex + CommitmentIndex +
## Unscheduled + Tenure + Age + EntryAge + CareerAge + CareerStage +
## ServiceLength + SalaryProg + JobLevelProg + AvgEval + LastPromotion +
## TotalEmployment, data = traindata)
##
## Variables actually used in tree construction:
## [1] Age BaseRate CareerAge CommitmentIndex
## [5] EngagementIndex Experience JobLevelProg LastPromotion
## [9] Shift Tenure ZipCode
##
## Root node error: 432/960 = 0.45
##
## n= 960
##
## CP nsplit rel error xerror xstd
## 1 0.181713 0 1.00000 1.00000 0.035681
## 2 0.048611 2 0.63657 0.64815 0.032600
## 3 0.032407 3 0.58796 0.63194 0.032355
## 4 0.023148 4 0.55556 0.58796 0.031637
## 5 0.020833 6 0.50926 0.55787 0.031100
## 6 0.016204 7 0.48843 0.55093 0.030970
## 7 0.013889 8 0.47222 0.54167 0.030793
## 8 0.011574 10 0.44444 0.52778 0.030521
## 9 0.010417 11 0.43287 0.53704 0.030704
## 10 0.010185 13 0.41204 0.53704 0.030704
## 11 0.010000 20 0.33333 0.52546 0.030475
prp(tree)
As can be seen with this method there is a 45% error rate, which is not a good model for our prediction.
Next the randomforest method is applied. Randomforest is one of the best classification tree predictive models in the industry today.
model.rf<-randomForest(traindata[,3:29],traindata$Status)
model.rf
##
## Call:
## randomForest(x = traindata[, 3:29], y = traindata$Status)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 18.54%
## Confusion matrix:
## ACTIVE TERM class.error
## ACTIVE 354 78 0.1805556
## TERM 100 428 0.1893939
training_pred <- predict(model, traindata)
confusionMatrix(training_pred, traindata$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ACTIVE TERM
## ACTIVE 405 52
## TERM 27 476
##
## Accuracy : 0.9177
## 95% CI : (0.8985, 0.9343)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8346
## Mcnemar's Test P-Value : 0.00693
##
## Sensitivity : 0.9375
## Specificity : 0.9015
## Pos Pred Value : 0.8862
## Neg Pred Value : 0.9463
## Prevalence : 0.4500
## Detection Rate : 0.4219
## Detection Prevalence : 0.4760
## Balanced Accuracy : 0.9195
##
## 'Positive' Class : ACTIVE
##
testdata_pred <- predict(model.rf, testdata)
confusionMatrix(testdata_pred, testdata$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ACTIVE TERM
## ACTIVE 232 66
## TERM 56 286
##
## Accuracy : 0.8094
## 95% CI : (0.7768, 0.8391)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6161
## Mcnemar's Test P-Value : 0.4152
##
## Sensitivity : 0.8056
## Specificity : 0.8125
## Pos Pred Value : 0.7785
## Neg Pred Value : 0.8363
## Prevalence : 0.4500
## Detection Rate : 0.3625
## Detection Prevalence : 0.4656
## Balanced Accuracy : 0.8090
##
## 'Positive' Class : ACTIVE
##
varImpPlot(model.rf,sort=TRUE,n.var=27)
As can be seen - the accuracy of the training data prediction is about 91% with an error rate of 18% and the test data accuracy is 80%. This is better than the previous result, but still not quite up to par for our requirements.
Next RandomForest with a 10-fold bootstrapping technique with replace is applied.
model <- train(traindata[,3:29],traindata$Status, method = "rf", prox = TRUE, trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE))
training_pred <- predict(model, traindata)
confusionMatrix(training_pred, traindata$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ACTIVE TERM
## ACTIVE 432 0
## TERM 0 528
##
## Accuracy : 1
## 95% CI : (0.9962, 1)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00
## Specificity : 1.00
## Pos Pred Value : 1.00
## Neg Pred Value : 1.00
## Prevalence : 0.45
## Detection Rate : 0.45
## Detection Prevalence : 0.45
## Balanced Accuracy : 1.00
##
## 'Positive' Class : ACTIVE
##
testdata_pred <- predict(model, testdata)
confusionMatrix(testdata_pred, testdata$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ACTIVE TERM
## ACTIVE 235 66
## TERM 53 286
##
## Accuracy : 0.8141
## 95% CI : (0.7817, 0.8435)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6259
## Mcnemar's Test P-Value : 0.2713
##
## Sensitivity : 0.8160
## Specificity : 0.8125
## Pos Pred Value : 0.7807
## Neg Pred Value : 0.8437
## Prevalence : 0.4500
## Detection Rate : 0.3672
## Detection Prevalence : 0.4703
## Balanced Accuracy : 0.8142
##
## 'Positive' Class : ACTIVE
##
varImp(model)
## rf variable importance
##
## only 20 most important variables shown (out of 27)
##
## Overall
## Age 100.00
## BaseRate 98.05
## LastPromotion 87.23
## Tenure 73.54
## ZipCode 69.43
## CareerAge 33.63
## SalaryProg 32.41
## ServiceLength 28.35
## TotalEmployment 28.24
## StartRate 27.09
## StartGrade 24.14
## JobLevelProg 23.37
## Grade 19.89
## EntryAge 19.13
## EngagementIndex 18.90
## CommitmentIndex 18.13
## Unscheduled 16.85
## AvgEval 15.82
## Experience 15.41
## LastEval 14.30
Finally we will apply the model against our validation set to test our prediction - Confusion matrix along with a list of employees that were incorrectly predicted is included for further analysis.
testing_pred <- predict(model, testing)
confusionMatrix(testing_pred, testing$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ACTIVE TERM
## ACTIVE 144 14
## TERM 12 163
##
## Accuracy : 0.9219
## 95% CI : (0.8877, 0.9484)
## No Information Rate : 0.5315
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8433
## Mcnemar's Test P-Value : 0.8445
##
## Sensitivity : 0.9231
## Specificity : 0.9209
## Pos Pred Value : 0.9114
## Neg Pred Value : 0.9314
## Prevalence : 0.4685
## Detection Rate : 0.4324
## Detection Prevalence : 0.4745
## Balanced Accuracy : 0.9220
##
## 'Positive' Class : ACTIVE
##
testing.pred<-cbind(testing,testing_pred)
subset(testing.pred, Status!=testing_pred, select=c("EmployeePPPER","Status","testing_pred"))
## EmployeePPPER Status testing_pred
## 18 18 TERM ACTIVE
## 119 118 TERM ACTIVE
## 377 378 TERM ACTIVE
## 405 407 ACTIVE TERM
## 424 426 TERM ACTIVE
## 719 724 TERM ACTIVE
## 724 729 TERM ACTIVE
## 742 749 TERM ACTIVE
## 808 815 TERM ACTIVE
## 843 850 ACTIVE TERM
## 854 861 TERM ACTIVE
## 866 873 ACTIVE TERM
## 896 903 TERM ACTIVE
## 1036 1043 TERM ACTIVE
## 1090 1097 TERM ACTIVE
## 1167 1174 ACTIVE TERM
## 1349 1358 ACTIVE TERM
## 1519 1568 ACTIVE TERM
## 1543 1592 ACTIVE TERM
## 1719 1769 ACTIVE TERM
## 1763 1814 TERM ACTIVE
## 1811 1863 TERM ACTIVE
## 1828 1880 ACTIVE TERM
## 1928 1980 ACTIVE TERM
## 1995 2047 ACTIVE TERM
## 2000 2052 ACTIVE TERM
We obtain a 92% accuracy on the validation dataset which is a pretty good prediction rate.
As can be seen from both randomforest models the top 5 factors affecting terminations are -
Baserate
Age
LastPromotion
Zipcode
Tenure
Lets plot these 5 variables against status and check their relationships again
scatterplotMatrix(~Status+BaseRate+Age+LastPromotion+ZipCode+Tenure, data=testing)
From the above matrix it can be concluded that younger employees (40 years or below) with lower BaseRates (<$20/hr), who have not received a promotion in a while (last 5 years), who commute long distances to work and have worked for the organization 5 years or less tend to leave the organization